How to Explain Individual Classification Decisions 



How to Explain Individual Classification Decisions 



O 

o 

(N 
O 

Q 



> 
OO 
(N 



David Baehrens* 
Timon Schroeter* 

Technische Universitat Berlin 

Franklinstr. 28/29, FR 6-9 

10587 Berlin, Germany 

Stefan Harmeling* 

MPI for Biological Cybernetics 

Spemannstr. 38 

12016 Tubingen, Germany 

Motoaki Kawanabe 

Fraunhofer Institute FIRST. IDA 

Kekulestr. 1 

12489 Berlin, Germany 
and 

Technische Universitat Berlin 
Franklinstr. 28/29, FR 6-9 
10581 Berlin, Germany 
Katja Hansen 
Klaus-Robert Miiller 
Technische Universitat Berlin 
Franklinstr. 28/29, FR 6-9 
10581 Berlin, Germany 



baehrensl9jcs.tu-berlin.de 
timon@cs.tu-berlin.de 



stefan.harmeling@tuebingen.mpg.de 



motoaki.kawanabe@first.fraunhofer.de 



khansen (9cs.tu-berlin.de 
klaus-robert.mueller@tu-berlin.de 



(N 
On 

o 



Editor: Carl Edward Rasmussen 



X 



Abstract 

After building a classifier with modern tools of machine learning we typically have a black 
box at hand that is able to predict well for unseen data. Thus, we get an answer to the 
question what is the most likely label of a given unseen data point. However, most methods 
will provide no answer why the model predicted the particular label for a single instance 
and what features were most influential for that particular instance. The only method 
that is currently able to provide such explanations are decision trees. This paper proposes 
a procedure which (based on a set of assumptions) allows to explain the decisions of any 
classification method. 
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1. Introduction 

Automatic nonlinear classification is a common and powerful tool in data analysis. Machine 
learning research has created methods that are practically useful and that can classify unseen 
data after being trained on a limited training set of labeled examples. 

Nevertheless, most of the algorithms do not explain their decision. However in practical 
data analysis it is essential to obtain an instance based explanation, i.e. we would like to 
gain an understanding what input features made the nonlinear machine give its answer for 
each individual data point. 

Typically, explanations are provided jointly for all instances of the training set, for ex- 
ample feature selection methods (including Automatic Relevance De termination) find out 



which inputs are salient for a good generalization (see for a review iGuvon and Elisseefi 



While this can give a coarse impression about the global usefulness of each in- 
put dimension, it is still an ensemble view and does not provide an answer on an in- 
stance basis In the neural netwo rk litera t ure a l so solely an ensemble view was taken i n 
algorithms like input pruning (e.g. Bishop . 1995 ; LeCun. Bottom Qrr. and Muller . 19981 ). 



The only classification which does prov ide individual explanations are decision trees (e.g. 
Hastie. Tibshirani. and Friedman! 200ll ). 



This paper proposes a simple framework that provides local explanation vectors applica- 
ble to any classification method in order to help understanding prediction results for single 
data instances. The local explanation yields the features being relevant for the prediction 
at the very points of interest in the data space and is able to spot local peculiarities which 
are neglected in the global view e.g. due to cancellation effects. 

The paper is organized as follows: We define local explanation vectors as class probability 
gradients in Section [2] and give an illustration for Gaussian Process Classification (GPC). 
Some methods output a prediction without a direct probability interpretation. For these 
we propose in Section [3] a way to estimate local explanations. In Section [4] we will apply 
our methodology to learn distinguishing properties of Iris flowers by estimating explanation 
vectors for a k-NN classifier applied to the classic Iris data set. Section [5] will discuss how our 
approach applied to a SVM classifier allows us to explain how digits "two" are distinguished 
from digit "8" in the USPS data set. In Section [6] we discuss a more real- world application 
scenario where the proposed explanation capabilities prove useful in drug discovery: Human 
experts regularly decide how to modify existing lead compounds in order to obtain new 
compounds with improved properties. Models capable of explaining predictions can help in 
the process of choosing promising modifications. Our automatically generated explanations 
match with chemical domain knowledge about toxifying functional groups of the compounds 
in question. Section [7] contrasts our approach with related work and Section [8] discusses 
characteristic properties and limitations of our approach, before we conclude the paper in 
Section [9l 



This point is illustrated in Figure [T] (Section [2}. Applying feature selection methods to the training 
set (a) will lead to the (correct) conclusion that both dimensions are equally important for accurate 
classification. As an alternative to this ensemble view, one may ask: Which features (or combinations 
thereof) are most influential in the vicinity of each particular instance. As can be seen in Figure [T] (c), 
the answer depends on where the respective instance is located. On the hypotenuse and at the corners 
of the triangle, both features contribute jointly, whereas along each of the remaining two edges the 
classification depends almost completely on just one of the features. 
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2. Definitions of Explanation Vectors 

In this Section we will give definitions for our approach of local explanation vectors in the 
classification setting. We start with a theoretical definition for multi-class Bayes classifica- 
tion and then give a specialized definition being more practical for the binary case. 

For the multi-class case, suppose we are given data points xi, . . . ,x n G with labels 
2/1; ■ ■ ■ 5 Un £ {1, • • • ) C} and we intend to learn a function that predicts the labels of unlabeled 
data points. Assuming that the data could be modeled as being IID-sampled from some 
unknown joint distribution P(X,Y), in theory, we can define the Bayes classifier, 

g*(x) = arg min P(Y^c \ X = x) 

c£{l,...,C} 



which is optimal for the 0-1 loss function (see Devrove. Gvorfi. and Lugosi . 19961 ) 



For the Bayes classifier we define the explanation vector of a data point xq to be the 
derivative with respect to x at x = xo of the conditional probability of Y ^g*(xo) given 
X = x, or formally, 

Definition 1 

C(so) := ^ P(Y^g*(x ) \ X = x) 

Note that C( x o) is a rf-dimensional vector just like xq is. The classifier g* partitions the 
data space !Ji rf into up to C parts on which g* is constant. We assume that the conditional 
distribution P(Y = c \ X = x) is first-order differentiable w.r.t. x for all classes c and over 
the entire input space. For instance, the assumption holds, if P(X = x \ Y = c) is for all 
c first-order differentiable in x and the supports of the class densities overlap around the 
boarder for all the neighboring pairs in the partition by the Bayes classifier. The vector 
C(xq) defines on each of those parts a vector field that characterizes the flow away from the 
corresponding class. Thus entries in ((xo) with large absolute values highlight features that 
will influence the class label decision of xq. A positive sign of such an entry implies that 
increasing that feature would lower the probability that xq is assigned to g*{xo). Ignoring the 
orientations of the explanation vectors, £ forms a continuously changing (orientation-less) 
vector field along which the class labels change. This vector field lets us locally understand 
the Bayes classifier. 

We remark that CO^o) becomes a zero vector, e.g. when P(Y ^ g* (xo) \ X = x)\ x=XQ is 
equal to one in some neighborhood of xq. Our explanation vector fits well to probabilistic 
classifiers such as Gaussian Process Classification (GPC), where the conditional distribution 
P(Y = c | X = x) is usually not completely flat in some regions. In the case of deterministic 
classifiers, despite of this issue, Parzen window estimators with appropriate widths (Section 
[3]) can provide meaningful explanation vectors for many samples in practice (see also Section 
E}. 

For the case of binary classification we directly define local explanation vectors as local 
gradients of the probability function p(x) = P(Y = 1 | X = x) of the learned model for the 
positive class. 

So for a probability function p : 5R d — > [0, 1] of a classification model learned from 
examples {(x\, yi), . . . , (x n , y n )} € K d x {—1, +1} the explanation vector for a classified test 
point xq is the local gradient of p at xq: 
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Definition 2 



i]p(x ) := Vp(x)\ 



X=XQ 



By this definition the explanation i] is again a d-dimensional vector just like the test 
point xo is. The sign of each of its individual entries indicates whether the prediction would 
increase or decrease when the corresponding feature of xq is increased locally and each 
entry's absolute value give the amount of influence in the change in prediction. As a vector 
r\ gives the direction of the steepest ascent from the test point to higher probabilities for the 
positive class. For binary classification the negative version — t] v {xq) indicates the changes 
in features needed to increase the probability for the negative class which may be especially 
useful for xq predicted in the positive class. 

For an example we ap ply Definition [2] to model predict ions learned by Gaussian Process 
Classification (GPC), see Rasmussen and Williamafeood ). GPC is used here for three rea- 
sons: 

(i) In our real-world application we are interested in classifying data from drug discovery, 



see e.e. Obrezanova, Csanvi. Gola. and Seea 


] (2007); Schroeter. Schwaiehofer. Mika. ter Laak. Siilzle. Ganzer. 


(2007c): Schroeter. Schwaiehofer. Mika. Laak. Suelzle. Ganzer. Heinrich, and Muller 


(2007alb): 


Schwaiehofer. Schroeter. Mika. Laub. ter Laak. Siilzle. Ganzer. Heinrich. and Miillei 


(120071): 


Schwaiehofer. Schroeter. Mika. Hansen, ter Laak. Lienau. Reichel. Heinrich. and Muller (2008): 


Obrezanova. Gola. Champness. and See:al] 


2008). It is natural to expect a model with high 



prediction accuracy on a complex problem to capture relevant structure of the data which is 
worth explaining and may give domain specific insights in addition to the values predicted. 
For an evaluation of the explaining capabilities of our approach on a complex problem from 
chemoinformatics see Section [6l 

(ii) GPC does model the class probability function used in Definition [2] directly. For other 
classification methods such as Support Vector Machines which do not provide a probability 
function as its output in Section [3] we give an example for an estimation method starting 
from Definition [TJ 

(iii) The local gradients of the probability function can be calculated analytically for differ- 
entiable kernel as we discuss next. 

Let f(x) = Ya=i otik{x,Xi) be a GP model trained on sample points x±, ... ,x n S 
where A; is a kernel function and cti are the learned weights of each sample point. For a test 
point xq G K d let vaif(xo) be the variance of /(xq) under the GP posterior for /. Because 
the posterior cannot be calculated analytically for GP classification models , we used an 
approximation by expectation propagation (EP) ( Kuss and Ramussen . 20051 ) . In the case 



of the probit likelihood term defined by the error function, the probability for being of the 
positive class p(xo) can be computed easily from this approximated posterior as 



wher e erfc denotes the complementary error function (see Equation 6 in lSchwaighofer. Schroeter. Mika. Hansen. 
20081 ). 

Then the local gradient of p(xq) is given by 
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Vp(x) 



X=XQ 



Vlerfc - ZM 
2 I y/l + var f (a 



X=XQ 



2\ I v^* V 1 + var /( x ) 



2 I y^* a/1 + var/(a 



ex P Ui+™r/W) 



V 



-/(*) 



cxp 



7T 

-7(*o) 2 

2(l+va r/ (a:o)) 



V2 * \/l + var j (a 



/(*) 



cxp 



-7(^o) 2 



2(l+var/(xo)) 



I \/2 I -y/l + var j (x 
V7(»)|x=x 



cxp 



2vr 

-7(*o) 2 

2(l+var/(x )) 



a/1 + var/(x ) 
V7(x)| x=xo 



+ /(^o) ( V var / (x)| x=a;o * -- (1 + var/(x )y 
3-Vvar / (x)U =X0 J . 



\y/l + var f (x ) 2 (1 +var/(x ))2 

As a kernel function choose e.g. the RBF-kernel k(xo,xi) = exp(—w(xQ — xi) 2 ), which has 
the derivative (d/dxoj)k(xQ,xi) = — 2w exp(— w(x — xi) 2 )(x j — Xij) for j £ {1,... , d}. 
Then the elements of the local gradient V/(x) 



X=XQ 



arc 



df 

dx . 



-2ti;^a i exp(-'u;(xo - Xi) 2 )(x 0i j - x iyj ) for j G {!,... ,d}. 



i=l 



For varj(xo) = /c(xo,xo) — kJ(K + S) *A;* the derivative is given bjH 
dvary 



Vvar r (x 



f\X)\x=XQ 



8xq 



dx 



k(x ,x ) ) -2*^(^ + S)" 1 -^-^ fori € {l,-..,4- 



0,7 



5xq 



Panel (a) of Figure [T] shows the training data of a simple object classification task 
and panel (b) shows the model learned using GPCd. The data is labeled —1 for the blue 
points and +1 for the red points. As illustrated in panel (b) the model is a probability 
function for the positive class which gives every data point a probability of being in this 



2. Here fc* = (k(xo, x\), . . . , k(xo, x n )) T is the evaluation of the kernel function between the test point 
xp and every training po i nt. S is the diagonal matrix of the variance site parameter. For details see 
iRasmussen and William^ l)2006l . Chapter 3) 

3. Hyperparameters were tuned by a gradient ascend on the evidence. 
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(a) Object (b) Model 




(c) Local explanation vectors (d) Direction of explanation vectors 



Figure 1: Explaining simple object classification with Gaussian Processes 

class. Panel (c) shows the probability gradient of the model together with the local gradient 
explanation vectors. On the hypotenuse and at the corners of the triangle explanations from 
both features interact towards the triangle class while along the edges the importance of 
one of the two feature dimensions singles out. At the transition from the negative to the 
positive class the length of the local gradient vectors represents the increased importance 
of the relevant features. In panel (d) we see that explanations close to the edges of the 
plot (especially in the right hand side corner) point away from the positive class. However, 
panel (c) shows that their magnitude is very small. For discussion of this issue, see Section 

El 

3. Estimating Explanation Vectors 

Several classifier methods estimate directly the decision rule, which often has no interpre- 
tation as a probability function which is used in our Definition [2] in Section [2j For example 
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Support Vector Machines estimate a decision function of the form 

n 

f( x ) = ^2 / a i k{xi,x) + b, 
i=i 

cti,b £ 5i. Suppose we have two classes (each with one cluster) in one dimension (see 
Figure [2]) and train a SVM with RBF kernel. For points outside the data clusters f(x) 
tends to zero. Thus, the derivative of f(x) (shown as arrows above the curves) for points 
on the very left or on the very right side of the axis will point to the wrong side. In the 









classifier output 







+ +-HH-+ + + O O OOSDO o 




Figure 2: Classifier output of an SVM (top) compared to p(y = l\x) (bottom). 



following, we will explain how explanations can be obtained for such classifiers. 

In practice we do not have access to the true underlying distribution P(X,Y). Con- 
sequently, we have no access to the Bayes classifier as defined in Section [2l Instead we 
apply sophisticated lea r ning machinery like Support Vector Machin es (tvapniki . 1199.4 



can 



Scholkopf and Smolal . l2002l : iMuller. Mika. Ratsch. Tsuda. and Scholkopj 12001 



that esti- 
mates some classifier g that tries to mimic g*. For test data points z\, . . . ,z m £ 5R d which 
are assumed to be sampled from the same unknown distribution as the training data, g 
estimates labels g(zi), . . . ,g(z m ). Now, instead of trying to explain g* to which we have 
no access, we will define explanation vectors that help us understand the classifier g on the 
test data points. 

Since we do not assume that we have access to some intermediate real-valued classifier 
output here (of which g might be a thresholded version and which further might not be 
an estimate of P(Y = c \ X = x)), we suggest to approximate g by another classifier g the 
actual form of which resembles the Bayes classifier. There are several choices for g, e.g. 
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GPC, logistic regression and Parzen windowsQ In this paper we apply Parzen windows to 
the training points to estimate the weighted class densities P(Y = c) ■ P(X \ Y = c), 

p a (x,y = c) = - y~] k a (x - Xi) (1) 
iei c 

for the index set I c = {i \ g(xi) = c} and with k a {z) being a Gaussian kernel k a (z) = 
exp(— 0.5 z'z /a 2 )/V 2na 2 (as always other kernels are also possible). This estimates P(Y= 
c \ X = x) for all c, 

p a (x,y = c) + p a (x,y^c) ^ h„(x - Xi), 

and thus an estimate of the Bayes classifier (that mimics g) , 

g a {x) = arg min p a (y^c\x). 
ce{i,...,C} 

This approach has the advantage, that we can use our estimated classifier g to generate 
any amount of labeled data for for constructing g. The single hyper-parameter a is chosen, 
such that g approximates g (which we want to explain), i.e. 

m 

<5-:=argminV I {g(zj)^g a (zj)} , 

a * * 

3=1 

where /{•••} is the indicator function, a is assigned the constant value a from here on and 
omitted as a subscript. For g it is straightforward to define explanation vectors: 

Definition 3 

d ...... , (E«/, w *(«-^))(Ei6/, w *(*-^)(«-^)) 

C{z) ■= g- p{y^g{z) I a- 



cr 2 [YJLiHz - Xi)) 



V 2 {Ya=iK z ~ x i)) 2 

which is easily derived using Eq. ([2]) and the derivative of Eq. ([1]), see Appendix lA.2.11 Note 
that we use g instead of g. This choice ensures that the orientation of C(z) fits to the labels 
assigned by g, which allows better interpretations. 

In summary, we imitate the classifier g which we would like to explain locally, by a Parzen 
window classifier g that has the same form as the Bayes estimator and for which we can 
thus easily estimate the explanation vectors using Definition [3l Practically there are some 
caveats: the mimicking classifier g has to be estimated from g even in high dimensions; this 
needs to be done with care. However, in principle we have an arbitrary amount of training 
data available for constructing g since we may use our estimated classifier g to generate 
labeled data. 



4. For Support Vector Machines IPIattl (jl999h fits a sigmoid function to map the outputs to probabilities. 
In the following, we will present a more general method for estimating explanation vectors. 
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4. Explaining Iris Flower Classification by fc-Nearest Neighbors 

The Iris flower data set (introduced in Fisher . 19361 ) describes 150 flowers from the genus 
Iris by 4 features: sepal length, sepal width, petal length, and petal width, which are easily 
measured properties of certain leaves of the corolla of the flower. There are three clusters 
in that data which correspond to three different species: Iris setosa, Iris virginica, and Iris 
versicolor. 

Let us consider the problem of classifying the data points of Iris versicolor (class 0) 
against the other two species (class 1). We applied some standard classification machinery 
to this problem as detailed in the following: 



• Class consists of all examples of Iris versicolor. 

• Class 1 consists of all examples of Iris setosa and Iris virginica. 

• Randomly split 150 data points into 100 training and 50 test examples. 

• Normalize training and test set using the mean and variance of the training set. 

• Apply A;-nearest neighbor classification with k = 4 (chosen by leave-one-out cross 
validation on the training data). 

• Training error is 3% (i.e. 3 mistakes in 100). 



• Test error is 8% (i.e. 4 mistakes in 50). 



In order to estimate explanation vectors we mimic the classification results with a Parzen 
window classifier. The best fit (3% error) is obtained with a kernel width of a = 0.26 
(chosen by leave-one-out cross validation on the training data). 

Since the explanation vectors live in the input space we can visualize them with scatter 
plots of the initially measured features. The resulting explanations (i.e. vectors) for the test 
set are shown in Figure [3l The blue dots correspond to explanation vectors for Iris setosa 
and the red dots for Iris virginica (both class 1). Both groups of dots point to the green 
dots of Iris versicolor. The most important feature is the combination of petal length and 
petal width (see the corresponding panel), the product of which corresponds roughly to the 
area of the petals. However, the resulting explanations for the two species in class 1 are 
different: 



• Iris setosa (class 1) is different from Iris versicolor (class 0) because its petal area is 
smaller. 



• Iris virginica (class 1) is different from Iris versicolor (class 0) because its petal area 
is larger. 

Also the dimensions of the sepal (another part of the blossom) is relevant, but not as 
distinguishing. 
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Figure 3: Scatter plots of the explanation vectors for the test data. Shown are all explana- 
tion vectors for both classes: class 1 containing Iris setosa (shown in blue) and Iris 
virginica (shown in red) versus class containing only one species Iris versicolor 
(shown in green). Note that the explanations why an Iris flower is not an Iris 
versicolor is different for Iris setosa and Iris virginica. 
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Figure 4: USPS digits (training set): 'twos' (left) and 'eights' (right) with correct classifi- 
cation. For each digit from left to right: (i) explanation vector (with black being 
negative, white being positive), (ii) the original digit, (iii-end) artificial digits 
along the explanation vector towards the other class. 
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Figure 5: USPS digits (test set bottom part): 'twos' (left) and 'eights' (right) with correct 
classification. For each digit from left to right: (i) explanation vector (with black 
being negative, white being positive), (ii) the original digit, (iii-end) artificial 
digits along the explanation vector towards the other class. 
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5. Explaining USPS Digit Classification by Support Vector Machine 

We now apply the framework of estimating explanation vectors to a high dimensional data 
set, the USPS digits. The classification problem that we designed for illustration purposes 
is detailed in the following list: 

• all digits are 16 x 16 images which are reshaped to 256 x 1 dimensional column vectors 

classifier: SVM from Schwaighofer ( 20021 ) with RBF- kernel- width a = 1 and regu- 



larization constant C = 10 (chosen by grid search in cross validation on the training 
data). 

• training set: 47 'twos', 53 'eights'; training error 0.00 

• test set: 48 'twos', 52 'eights'; test error 0.05 

We approximated the estimated class labels obtained by the SVM with the Parzen window 
classifier (Parzen window size a = 10.2505, chosen by grid search in cross validation on the 
training data) . The SVM and the Parzen window classifier only disagreed on 2% of the test 
examples, so a good fit was achieved. Figures H] and [5] show our results. All parts show three 
examples per row. For each example we display from left to right: (i) the explanation vector, 
(ii) the original digit, (iii-end) artificial digits along the explanation vector towards the other 
classll These artificial digits should help to understand and interpret the explanation vector. 
Let us first have a look at the results on the training set: 

Figure [4] (left panel): Let us focus on the top example framed in red. The line that forms 
the 'two' is part of some 'eight' from the data set. Thus the parts of the lines that 
are missing show up in the explanation vector: if the dark parts (which correspond to 
the missing lines) are added to the 'two' digit then it will be classified as an 'eight'. 
Or in other words, because of the lack of those parts the digit was classified as a 'two' 
and not as an 'eight'. A similar explanation holds for the middle example framed in 
red of the same Figure. Not all examples transform easily to 'eights': besides adding 
parts of black lines, some existing black spots (that the digit has to be a 'two') must 
be removed. This is reflected in the explanation vector by white spots/lines. Curious 
is the bottom 'two' framed in red, which is actually a dash and is in the data set by 
mistake. However, its explanation vector shows nicely which parts have to be added 
and which have to be removed. 

Figure [4] (right panel): we see similar results for the 'eights' class. The explanation 
vectors again tell us how the 'eights' must change to become classified as 'twos'. 
However, sometimes the transformation does not reach the 'twos'. This is probably 
due to the fact that some of the 'eights' are inside the cloud of 'eights'. 

On the test set the explanation vectors are not as pronounced as on the training set. 
However, they show similar tendencies: 



5. For the sake of simplicity, no intermediate updates were performed, i.e. artificial digits were generated 
by taking equal sized steps in the direction given by the original explanation vector calculated fot the 
original digit. 
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Figure [5] (left panel): we see the correctly classified 'twos'. Let's focus on the example 
framed in red. Again the explanation vector shows us how to edit the image of the 
'two' to make it some of the 'eights', i.e. exactly what parts of the digit have been 
important for the classification result. For several other 'twos' the explanation vectors 
do not directly lead to the 'eights' but weight the different parts of the digits which 
have been relevant for the classification. 

Figure [5] (right panel): similarly to the training data, we see that also these explanation 
vectors are not bringing all 'eights' to 'two'. Their explanation vectors mainly suggest 
to remove most of the eights (the dark parts) and add some in the lower part (the 
light parts, which look like a white shadow). 

Overall, our findings can be summarized, that the explanation vectors tell us how to edit 
our example digits to change the assigned class label. Hereby, we get a better understanding 
of the reasons why the chosen classifier classified the way it did. 



6. Explaining Mutagenicity Classification by Gaussian Processes 

In the following Section we describe an application of our local gradient explanation method- 
ology to a complex real world data set. Our aim is to find structure specific to the problem 
domain that has not been fed into training explicitly but is captured implicitly by the 
GPC model in the high-dimensional feature space used to determine its prediction. We 
investigate the task of predicting Ames mutagenic activity of chemical compounds. Not 
being mutagenic (i.e. not able to cause mutations in the DNA) is an important require- 
ment for compounds under investigation in drug discovery and design. The Ames test 
( Ames. Gurnev. Miller, and Bartsch . 19721 ) is a standard experimental setup for measuring 



mutagenicity. The following experiments are based on a set of Ames test results for 6512 
chemical compounds that we published previously!! 
GPC was applied as detailed in the following: 

• Class consists of non-mutagenic compounds 

• Class 1 consists of mutagenic compounds 
Randomly split 6512 data points into 2000 training and 4512 test examples such that: 

— The training set consists of equally many class and class 1 examples. 

— For the steroid compound class the balance in the train and test set is enforced. 

10 additional random splits were investigated individually. This confirmed the results 
presented below. 

Each example (chemical compound) is represented by a vector of counts of 142 molecu- 



lar su bstructures calculated using the Dragon software (jTodeschini. Consonni. Mauri, and Pavan 
200d ). 



6. See lHansen. Mika. Schroeter. Sutter. Laak. Steger-Hartmann. Heinrich. and Mulled (|2009h for results of 
modeling this set using different machine learning methods. The data itself is available online at 
|http : //ml . cs . tu-berlin . de/toxb enchmark 
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Figure 6: Receiver operating curve of GPC model for mutagenicity prediction 



Normalize training and test set using the mean and variance of the training set. 
Apply GPC model with RBF kernel 



Perfo rmance (84 % area under curve) confirms our previous results ([Hansen. Mika. Schroeter. Sutter. Laa 
2009). Error rates can be obtained from Figure [6j 



Together with the prediction we calculated the explanation vector (as introduced in Section 
[2] with Definition [2]) for each test point. The remainder of this Section is an evaluation of 
these local explanations. 

In Figures [7] and [8] we show the distribution of the local importance of selected fea- 
tures across the test set: For each input feature we generate a histogram of local impor- 
tance values, as indicated by its corresponding entry in the explanation vector of each of 
the 4512 test compounds. The features examined in Figure [7] are counts of substructures 
kn own to cause mutagenicity. We sh ow all approved "specific toxicophores" introduced 
by Kazius. McGuire. and Bursi ( 20051 ) that are also represented in the Dragon set of fea- 



tur es. The features shown in F i gure [%] are known to detoxify certain toxicophores (again 



Kazius. McGuire. and Bursi . 2005l b With the exception of [7£e) the toxicophores also 



see 

have a toxifying influence according to our GPC prediction model. Feature [3(e) seems to 
be mostly irrelevant for the prediction of the GPC model on the test points. In contrast 
the detoxicophores show overall negative influence on the prediction outcome of the GPC 
model. Modifying the test compounds by adding toxicophores will increase the probabil- 
ity of being mutagenic as predicted by the GPC model while adding detoxicophores will 
decrease this predicted probability. 

So we have seen that the conclusions drawn from our explanation vectors agree with 
established knowledge about toxicophores and detoxicophores. While this is reassuring, 
such a sanity check required existing knowledge about which compounds are toxicophores 
and detoxicophores and which are not. Thus it is interesting to ask, whether we also could 
have discovered that knowledge from the explanation vectors. To answer this question 
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^ure 7: Distribution of local importance of selected featur es across the test set of 4512 
comp ounds. Nine out of ten known toxicophores (jKazius. McGuire. and Bursi . 
20051 ) indeed exhibit positive local gradients. 
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Figure 8: Distribution of local importance of selected features across the test set of 4512 
compounds. All five known detoxicophores exhibit negative local gradients 



we ranked all 142 features by the means of their local gradient^]. Clear trends result: 9 
out of 10 known toxicophores can be found close the top of the list (mean rank of 19). 
The only exception (rank 81) is the aromatic nitrosamine featurell This trend is even 
stronger for the detoxicophores: The mean rank of these five features is 138 (out of 142), i.e. 
they consistently exhibit the largest negative local gradients. Consequently, the established 
knowledge about toxicophores and detoxicophores could indeed have been discovered using 
our methodology. 

In the following paragraph we will discuss steroidal as an example of an important 
compound class for which the meaning of features differs from this global trend, so that 
local explanation vectors are needed to correctly identify relevant features. 

Figure [9] displays the difference in relevance of epoxide (a) and aliphatic nitrosamine (c) 
substructures for the predicted mutagenicity of steroids and non-steroid compounds. For 

7. Tables resulting from this ranking are made available as a supplement to this paper and can be down- 
loaded from the journals website. 

8. This finding agrees with the result obtained by visually inspecting Figure [TJe). We found that only very 
few compounds with this feature are present in the data set. Consequently, detection of this feature is 
only possible if enough of these few compounds are included in the training data. This was not the case 
in the random split used to produce the results presented above. 

9. Steroids are natural products and occur in humans, animals and plants. They have a characteristic 
backbone containing four fused carbon-rings. Many hormones important to the development of the 
human body are steroids, including androgens, estrogens, progestagens, cholesterol and natural anabolics. 
These have been used as starting points for the development of many different drugs, including the most 
reliable contraceptives currently on the market. 
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Figure 9: The local distribution of feature importance to steroids and random non-steroid 
compounds significantly differs for two known toxicophores. The small local gra- 
dients found for the steroids (shown in blue) indicate that the presence of each 
toxicophore is irrelevant to the molecules toxicity. For non-steroids (shown in 
red) the known toxicophores indeed exhibit positive local gradients. 
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comparison we also show the distributions for compounds chosen at random from the test 
set (b,d). Each subfigure contains two measures of (dis-)similarity for each pair of distribu- 
tions. The p-value of the Kolmogorov-Smirnoff test (KS) gives the probability of error when 
rejecting the hypothesis that both relative frequencies are drawn from the same underlying 
distribution. The symmetrized Kullback-Leibler divergence (KLD) gives a metric of the dis- 
tance between the two distributions While containing epoxides generally tends to make 
molecules mutagenic (see discussion above), we do not observe this effect for steroids: In 
Figure 0(a), almost all epoxide containing non-steroids exhibit positive gradients, thereby 
following the global distribution of epoxide containing compounds as shown in Figure [3(f). 
In contrast, almost all epoxide containing steroids exhibit gradients just below zero. "Im- 
munity" of s teroids to the epoxide tqxicop hore is an established fact and has first been 
discussed by Glatt. Jung, and Qesch ( 19831 ). This peculiarity in chemical space is clearly 



exhibited by the local explanation given by our approach. For aliphatic nitrosamine, the 
situation in the GPC model is less clear but still the toxifying influence seems to be less in 
steroids than in many other compounds. To our knowledge, this phenomenon has not yet 
been discussed in the pharmaceutical literature. 

In conclusion, we can learn from the explanation vectors that: 



toxicophores tend to make compounds mutagenic (class 1) 

• detoxicophores tend to make compounds non-mutagenic (class 0) 

• steroids are immune to the presence of some toxicophores (epoxide, possibly also 
aliphatic nitrosamine) 

7. Related Work 

Assigning potentially different explanations to individual data points distinguishes our ap- 
proach from conventional feature extraction methods that extract global features that are 
relevant for all data points, i.e. those features that allow to achieve a small overall prediction 
error. Our notion of explanation is not related to the prediction error, but only to the label 
provided by the prediction algorithm. Even though the error is large, our framework is able 
to answer the question why the algorithm has decided on a data point the way it did. 

The explanation vector proposed here is similar in spirit to sensitivity analysis which is 
common to v arious areas of information science. A classical exam ple is the outlier sensitivity 
statistics (jHampel. Ronchetti. Rousseeuw. and StahelLll98fih . In this case, the effects of 



m 

removing single data points on estimated parameters are evaluated by an influence function. 
If the influence for a data point is significantly large, it is detected as an outlier and should be 
removed for the following analysis. In regression problems, leverage analysis is a procedure 
along similar lines. It detects leverage points which have potential to give large impact 
on the estimate of the regression function. In contrast to the influential points (outliers), 
removing a leverage sample may not actually change the regressor, if its response is very 
close to the predicted value. E.g. for linear regression the samples whose inputs are far from 



10. Symmetry is achieved by av eraging the two Kullback-Leibler divergences: KL ( P1 < P2 )+ KL ( P2 < P1 ) ^ c f 
Ijohnson and Sinanovi 5(|200Ch. To prevent zero-values in the histograms which would lead to infinite KL 
distances, an e > has been added to each bin count. 
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the mean are the leverage points. Our framework of explanation vectors considers a different 
view. It describes the influence of moving single data points locally and it thus answers the 
question which directions are locally most influential to the prediction. The explanation 
vectors are used for extracting sensitive features which are relevant to the prediction results, 
rather than detecting/eliminating the influential samples. 

In recent decades, explanation of results by expert systems have been an important 
topic in the AI community. Especially, for those based on Bayesian belief networks, such 
explanation is crucial in pra ctical use. In this context se n sitivi ty analysis has also been 
used as a guiding principle ( Horvitz. Breese. and Henrionl . Il988l ). There the influence is 
evaluated by removing a set of variables (features) from evidences and the explanation is 
constructed from those variables which affect inference (relevant variables). For example, 
Suermondtl ()1992l ) measures the cost of omitting a single feature Ei by the cross-entropy 



H-(Et) = H(p(D\E); P(D\E\Ei) ) = £ P{d 5 \E) log 



3=1 



where E denotes evidences and D = (di, . . . , dpf) T is the target variable. The cost of a 
subset F C E can be defined similarly. This line of research is more connected to our work, 
because explanation can depend on the assigned values of the evidences E, and is thus local. 

Similarly Robnik-Sikonja and Kononenko ( 20081 ) and Strumbelj and Kononenkol ( 20081 ) 
try to explain the decision of trained kNN-, SVM- and ANN-models for individual instances 
by measuring the difference in their prediction with sets of features omitted. The cost of 
omitting features is evaluated as the information difference, the log-odds ratio or the dif- 
ference of probabilities between the model with knowledge about all features and with 
omissions respectively. To know what the prediction would be without the knowledge of 
a certain feature the model is retrained for every choice of features whose influence is to 
be exp lained. To save the time of combinatorial training iRobnik-Sikonja and Kononenko 
( 20081 ) propose to use neutral values which have to be estimated by a known prior distri- 
bution of all possible parameter values. As a theoretical framework for considering feature 



interactions, Strumbelj and Kononenkol ( 20081 ) propose to calculate the differences between 



model predictions for every choice of feature subset. 



For multi-layer perceptrons iFeraud and Clerotl (|2002l ) measure the importance of indi- 
vidual input variables on clusters of test points. Therefore the change in the model output 
is evaluated for the cha nge of a single input variab le in a chosen interval while all other 
input variables are fixed. Lemaire and Feraudl ( 20071 ) use a similar approach on an instance 
by instance basis. By considering each input vari able in turn there is no way to measur e 
input feature interactions on the model output (see LeCun. Bottou. Orr. and Miiller , 19981 ) . 

The principal differences between our approach and these frameworks are: (i) We con- 
sider continuous features and no structure among them is required, while some other frame- 
works start from binary features and may require discretization steps with the need to 
estimate parameters for it. (ii) We allow changes in any direction, i.e. any weighted com- 
bination of variables, while other approaches only consider one feature at a time or the 
omission of a set of variables. 
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8. Discussion 

By now we have shown that our methods for calculating / estimating explanation vectors 
are useful in a variety of situations. In the following we discuss their limitations. 

What can we do, if the derivative is zero? This situation is depicted in Figure [TOl In 
the lower panel we see a two-dimensional data set consisting of three clusters. The middle 
cluster has a different class than the clusters on the left and on the right. Relevant for 
the classification is only the horizontal coordinate (i.e. x\). The upper panel shows the 
projected data and a representative slice of C( x )- However, the explanation ((x) for the 
center point of the middle cluster is the zero vector, because at that point p(Y = 1\X = x) is 
maximal. What can we do in such situations? Actually, the (normalized) explanation vector 
is derived from the following optimization problem for finding the locally most influential 
direction: argmax|| e || =1 {p(Y ^g*(xo)\X = xq + e) — p(Y^g*(xo)\X = xq)}. In case that 
the first derivative of the above criterion is zero, its Taylor expansion starts from the second 
order term, which is a quadratic form in its Hessian matrix. In the example data set with 
three clusters, the explanation vector is constant along the second dimension. The most 
interesting direction is given by the eigenvector corresponding to the largest eigenvalue of 
the Hessian. This direction will be in our example along the first dimension. Thus, we can 
learn from the Hessian that the first coordinate is relevant for the classification, but we do 
not obtain an orientation for it. Instead it means that both directions (left and right) will 
influence the classification. However, if the conditional distribution P(Y = 1 | X = x) is flat 
in some regions, no meaningful explanation can be obtained by the gradient-based approach 
with the remedy mentioned above. Practically, by using Parzen window estimators with 
larger widths, the explanation vector can capture coarse structures of the classifier at the 
points which are not so far from the boarders. In lA.2.2l we give an illustration of this point. 
In the future, we would like to work on glo bal approaches, e.g. based on di s tance s to the 



boarders, or extensions of the approach by iRobnik-Sikonia and Kononenkd (|2008l ). Since 
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these procedures are expected to be computationally demanding, our proposal is useful in 
practice, in particular for probabilistic classifiers. 

Does our framework generate different explanations for different prediction 
models? When using the local gradient of the model prediction directly as in Defini- 
tion [2] and Section [61 the explanation follows the given model precisely by definition. For 
the estimation framework this depends on whether the different classifiers classify the data 
differently. In that case the explanation vectors will be different, which makes sense, since 
they should explain the classifier at hand, even if its estimated labels were not all correct. 
On the other hand, if the different classifiers agree on all labels, the explanation will be 
exactly equal. 

Which implicit limitations do analytical gradients inherit from Gaussian Process 
models? A particular phenomenon can be observed at the boundaries of the training data: 
Far from the training data, Gaussian Process Classification models predict a probability of 
0.5 for the positive class. When querying the model in an area of the feature space where 
predictions are negative, and one approaches the boundaries of the space populated with 
training data, explanation vectors will point away from any training data and therefore 
also away from areas of positive prediction. This behavior can be observed in Figure [T^d), 
where unit length vectors indicate the direction of explanation vectors. In the right hand 
side corner, arrows point away from the triangle. However, we can see that the length 
of these vectors is so small, that they are not even visible in Figure QJc). Consequently, 
this property of GPC models does not pose a restriction for identifying the locally most 
influential features by investigating the features with the highest absolute values in the 
respective partial derivatives, as shown in Section [6j 

Stationarity of the data. Since explanation vectors are defined as local gradients of the 
model prediction (see Definition [2]) , no assumption on the data is made: The local gradients 
follow the predictive model in any case. If, however, the model to be explained assumes 
stationarity of the data, the explanation vectors will inherit this limitation and reflect any 
shortcomings of the model (e.g. when the model is applied to non-stationary data). Our 
method for estimating explanation vectors, on the other hand, assumes stationarity of the 
data. 

When modeling data that is in fact non-stationary, appropriate measures to deal with 
such data sets should be taken. One option is to separate the feature space into sta- 
tionary and non-stationary parts using S tation ary Subspace Analysis as introduced by 



von Biinau. Meinecke. Kiralv. and Miillerl (|2009h. For further approach e s to data set shift 



seelSugivama. Nakaiima, Kashi ma. von Buenau. and Kawanabd (I2007bl) . IS ugiva ma. Krauledat. and Miiller 
( 2007a ) and the book by Quionero-Candela. Sugiyama. Schwaighofer. and Lawrence ( 20091 ). 



9. Conclusion 

This paper proposes a method that sheds light into the black boxes of nonlinear classifiers. 
In other words, we introduce a method that can explain the local decisions taken by arbitrary 
(possibly) nonlinear classification algorithms. In a nutshell, the estimated explanations are 
local gradients that characterize how a data point has to be moved to change its predicted 
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label. For models where such gradient information cannot be calculated explicitly, we 
employ a probabilistic approximate mimic of the learning machine to be explained. 

To validate our methodology we show how it can be used to draw new conclusions on 
how the various Iris flowers in Fisher's famous data set are different from each other and 
how to identify the features with which certain types of digits 2 and 8 in the USPS data set 
can be distinguished. Furthermore, we applied our method to a challenging drug discovery 
problem. The results on that data fully agree with existing domain knowledge, which was 
not available to our method. Even local peculiarities in chemical space (the extraordinary 
behavior of steroids) was discovered using the local explanations given by our approach. 

Future directions are two-fold: First we believe that our method will find its way into the 
tool boxes of practitioners who not only want to automatically classify their data but who 
also would like to understan d the learned classifier. Thus using our e xplan ation framework 
in computation biology (see ISonnenburg. Zien. Philips, and Ratschl. 120081) and in decision 
making experiments in psychophysics (e.g. lKienzle. Franz. Scholkopf. and Wichmann . 
seems most promising. The second direction is to generalize our approach to other prediction 
problems such as regression. 
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A. Appendix 

A.l Illustration of direct local gradients 

In the following we give some illustrative examples of our method to explain models using 
local gradients. Since the explanation is derived directly from the respective model, it 
is interesting to investigate its acuteness depending on different model parameters and in 
instructive scenarios. We examine the effects that local gradients exhibit when choosing 
different kernel functions, when introducing outliers, and when the classes are not linearly 
separable locally. 



A. 1.1 Choice of kernel function 

Figure QT] shows the effect of different kernel functions on the triangle toy data from Figure 
[T] in Section [2j The following observations can be made: 

• In any case note that the local gradients explain the model, which in turn may or may 
not capture the true situation. 



In Subfigure 11(a) the linear kernel leads to a model which fails to capture the non- 



linear class separation. This model misspecification is reflected by the explanations 



given for this model in Subfigure 11(b) 
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The rational quadratic kernel is able to more accurately model the non-linear sep- 
aration. In Subfigure 11(c) a non-optimal degree parameter has been chosen for 
illustrative purposes. For other parameter values the rational quadratic kernel leads 
to similar results as the RBF kernel function used in Figure [TJ 



• The explanations in Subfigure 11(d) obtained for this model show local perturbations 
at the small "bumps" of the model but the trends towards the positive class are still 
clear. As previously observed in Figure [TJ the explanations make clear that both 
features interact at the corners and on the hypotenuse of the triangle class. 

A. 1.2 Outliers 

In Figure [12] the effects of two outliers in the classification data to GPC with RBF kernel are 
shown. Once more, note that the local gradients explain the model, which in turn may or 
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(a) outliers in classes (b) outliers in model 
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(c) outlier explanation 
Figure 12: The effect of outliers to the local gradient explanations 

may not capture the true situation. The size of the region affected by the outliers depends 
on the kernel width parameter. We consider the following items: 

• Local gradients are in the same way sensitive to outliers as the model which they try 
to explain. Here a single outlier deforms the model and with it the explanation which 
may be extracted from it. 

• Being derivatives the sensitivity of local gradients to a nearby outlier is increased over 
the sensitivity of the model prediction itself. 

• Thus the local gradient of a point near an outlier may not reflect a true explanation 
of the features important in reality. Nevertheless it is the model here which is wrong 
around an outlier in the first place. 

• The histograms in the Figures El El and [9] in Section [6] show the trends of the respective 
features in the distribution of all test points and are thus not affected by single outliers. 
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To compensate for the effect of outliers to the local gradients of points in the affected 
region we propose to use a sliding window method to smooth the gradients around each 
point of interest. Thus for each point use the mean of all local gradients in the hypercube 
centered at this point and of appropriate size. This way the disrupting effect of an outlier 
is averaged out for an appropriately chosen window size. 

A. 1.3 Local non-linearity 




0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 



(c) locally non-linear explanation 
Figure 13: The effect of local non-linearity to the local gradient explanations 

The effect of locally non- linear class boundaries in the data is shown in Figure [13] again 
for GPC with an RBF kernel. The following points can be observed: 

• All the non-linear class boundaries are accurately followed by the local gradients 

• The circle shaped region of negative examples surrounded by positive ones shows the 
full range of feature interactions towards the positive class 
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• On the ridge of single positive instances the model introduces small valleys which are 
reflected by the local gradients 



A. 2 Estimating by Parzen window 

Finally we elaborate on some details of our estimation approach of local gradients by Parzen 
window approximation. First we give the derivation to obtain the explanation vector and 
second we examine how the explanation varies with the goodness of fit of the Parzen window 
method. 

A. 2.1 Derivation of explanation vectors 

These are more details on the derivation of Eq. ([3|). We use the index set I c = {i\ g(xi) = c}: 
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and thus for the index set I g ( z ) = {i \ g( x i) = g(z)} 
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A. 2. 2 Goodness of fit by Parzen window 

In our estimation framework the quality of the local gradients depends on the approximation 
of the classifier we want to explain by Parzen windows for which we can calculate the 
explanation vectors as given by Definition [3j 




(a) SVM model (b) estimated explanation with a = 0.00069 




0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 



(c) estimated explanation with a = 0.1 (d) estimated explanation with a = 1.0 

Figure 14: Good fit of Parzen window approximation affects the quality of the estimated 
explanation vectors 



Figure 14(a) shows an SVM model trained on the classification data from Figure 13(a) 
The local gradients estimated for this model by different Parzen window approximations 
are depicted in Subfigures 14(b) , 14(c) , and 14(d) We observe the following points: 



• The SVM model was trained with C = 10 and using an RBF kernel of width a = 0.01 



In Subfigure 14(b) a small window width has been chosen by minimizing the mean 
absolute error over the validation set of labels predicted by the SVM classifier. Thus we 
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obtain explaining local gradients on the class boundaries but zero vectors in the inner 
class regions. While this resembles the piecewise flat SVM model most accurately it 
may be more useful practically to choose a larger width to obtain non-zero gradients 
pointing to the borders in this regions as well. For a more detailed discussion of zero 
gradients see Section El 



A larger width practically useful in this example is shown in Subfigure 14(c) Here 
the local gradients in the inner class regions point to the other class as well. 



For a too large window width in Subfigure 14(d) the approximation fails to obtain 
local gradients which closely follow the model. Here only two directions are left and 
the gradients for the blue class on the left and on the bottom point in the wrong 
direction. 
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